-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
test: Add unit tests to test multiple files in single dataset #412
test: Add unit tests to test multiple files in single dataset #412
Conversation
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Thanks for making a pull request! 😃 |
1 similar comment
Thanks for making a pull request! 😃 |
Thanks for making a pull request! 😃 |
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I like splitting test data into different folders by type. Have a couple of comments below. I'm interested to hear Dushyant's thoughts before merging
[ | ||
TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL, | ||
TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL, | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a benefit in using multiple types of data here such as JSONL for one dataset, Arrow for another, and Parquet for a third?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May not be necessary, Good to have it though. Added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been added? I do not see a mix of the different type of data @willmj asked here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was added. I can see it in commit changes here. Let me know of other changes required in this unit test.
), | ||
], | ||
) | ||
def test_process_dataconfig_multiple_files(data_config_path, list_data_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding a test with three files just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this test case already have multiple cases, added case with 3 files in same unit test for all 3 handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reduce the test cases number and possibly just have
- Mix of all three -> 1 test
- each dataset multiple files, either 2 or three, maybe in a random mix
I think 3-4 scenrios should be fine, rest are anyway similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see there is one with varied data formats below...so maybe just a reduction of number of tests here could also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have optimized the unit tests based on below comments. If we still need to optimize more do let me know here. @dushyantbehl
if dataset_text_field not in element: | ||
raise KeyError(f"Dataset should contain {dataset_text_field} field.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
), | ||
], | ||
) | ||
def test_process_dataconfig_multiple_datasets_datafiles(datafiles, datasetconfigname): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case looks good. Updated the description. Thanks @willmj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this test be combined with this one?
def test_process_dataset_configs_with_sampling(datafiles, datasetconfigname): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Combined into test_process_dataconfig_multiple_datasets_datafiles_sampling
.
.pylintrc
Outdated
@@ -333,7 +333,7 @@ indent-string=' ' | |||
max-line-length=100 | |||
|
|||
# Maximum number of lines in a module. | |||
max-module-lines=1200 | |||
max-module-lines=1400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we will keep hitting this. I also had to disable
this specifically for test_sft_trainer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Removed it after merging in the change from the mentioned PR.
@@ -19,37 +19,47 @@ | |||
|
|||
### Constants used for data | |||
DATA_DIR = os.path.join(os.path.dirname(__file__)) | |||
JSON_DATA_DIR = os.path.join(os.path.dirname(__file__), "json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this segregation.
@@ -491,6 +497,284 @@ def test_process_dataconfig_file(data_config_path, data_path): | |||
assert formatted_dataset_field in set(train_set.column_names) | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"data_config_path, list_data_path", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we change the variable name list_data_path
to data_path_list
), | ||
], | ||
) | ||
def test_process_dataconfig_multiple_files_varied_types( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be combined with the above test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is added! Thank you!
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Thank you @dushyantbehl for the review. Have addressed the changes. Feel free to have a look once again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Abhishek-TAMU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of the change
1- Added unit test case for verifying handling of multiple data files of 1
DataSetConfig
passed via data_config.2- Added unit test case for verifying handling of multiple data files of 1
DataSetConfig
with (different format) passed via data_config.3- Added unit test case for verifying handling of multiple data files of 1
DataSetConfig
with (different type) passed via data_config.4- Added unit test case (by @willmj) for verifying handling of multiple data files of multiple
DataSetConfig
with (different format) passed via data_config.5- Data files in the
tests/artifacts/testdata
directory have been organized by file format for better categorization.6- Unit tests in
test_sft_trainer.py
for e2e testing:Related issue number
https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1487
How to verify the PR
Verify unit test additions.
Was the PR tested